Homogeneity Score (homogeneity_score)#
Homogeneity is an external clustering metric: it scores how pure each predicted cluster is with respect to ground-truth class labels.
Intuition: If I open a cluster, do I mostly see one class?
Perfectly pure clusters → score = 1.0
Completely mixed clusters (clusters don’t help predict the class) → score ≈ 0.0
Learning goals#
By the end you should be able to:
explain homogeneity in terms of entropy
compute it from a contingency matrix (class × cluster counts)
implement
homogeneity_scorefrom scratch in NumPyvisualize what increases / decreases the score
use it to tune a simple clustering algorithm (with caveats)
Quick import#
from sklearn.metrics import homogeneity_score
Table of contents#
Intuition: purity vs completeness
The math: entropy & conditional entropy
NumPy implementation (from scratch)
Worked toy example + plots
How mixing affects homogeneity
Pitfall: over-segmentation
Using homogeneity to tune k-means (grid search)
Pros/cons + when to use
import numpy as np
import plotly.express as px
import plotly.graph_objects as go
import os
import plotly.io as pio
from plotly.subplots import make_subplots
from sklearn.cluster import KMeans
from sklearn.datasets import make_blobs
from sklearn.metrics import (
completeness_score as sk_completeness_score,
homogeneity_score as sk_homogeneity_score,
v_measure_score as sk_v_measure_score,
)
pio.templates.default = "plotly_white"
pio.renderers.default = os.environ.get("PLOTLY_RENDERER", "notebook")
np.set_printoptions(precision=4, suppress=True)
rng = np.random.default_rng(7)
1) Intuition: purity vs completeness#
Homogeneity cares about purity inside each predicted cluster.
If a cluster contains multiple ground-truth classes, it’s impure → homogeneity goes down.
If a ground-truth class gets split across many clusters, homogeneity does not complain.
That second point is why homogeneity is often paired with completeness:
Homogeneity: each cluster contains only members of a single class.
Completeness: all members of a given class are assigned to the same cluster.
Both together are summarized by the V-measure (harmonic mean).
A key property: homogeneity is label-permutation invariant. If you relabel clusters (e.g., swap cluster 0 and 1), the score doesn’t change.
2) The math: entropy & conditional entropy#
We have:
ground-truth class labels: \(c \in \{1,\dots,C\}\) (random variable \(C\))
predicted cluster labels: \(k \in \{1,\dots,K\}\) (random variable \(K\))
2.1 Contingency matrix#
Let the contingency matrix \(N \in \mathbb{N}^{C\times K}\) count co-occurrences:
Define totals:
\(n = \sum_{c,k} N_{c,k}\)
class counts: \(n_c = \sum_k N_{c,k}\)
cluster counts: \(n_k = \sum_c N_{c,k}\)
2.2 Entropy#
The entropy of the class variable is
2.3 Conditional entropy#
The conditional entropy of classes given clusters is
where
2.4 Homogeneity score#
Homogeneity is defined as
Edge case: if \(H(C)=0\) (all points belong to one class), homogeneity is defined as 1.0.
Interpretation:
\(H(C\mid K)=0\) ⇒ each cluster determines the class perfectly ⇒ \(h=1\)
\(H(C\mid K)=H(C)\) ⇒ clusters tell you nothing about the class ⇒ \(h=0\)
Note: the log base cancels in the ratio, so you can use natural log.
A nice identity (using mutual information \(I(C;K)\)):
So homogeneity is the fraction of class entropy explained by the clustering.
3) NumPy implementation (from scratch)#
We’ll implement:
a contingency matrix builder (any label types)
entropy + conditional entropy from counts
homogeneity_scoreusing the definition above
def encode_labels(y):
'''Map arbitrary labels to integer ids 0..(m-1).'''
y = np.asarray(y)
classes, y_idx = np.unique(y, return_inverse=True)
return classes, y_idx
def contingency_matrix_np(y_true, y_pred):
'''Contingency matrix N with N[c,k] = count(true=c, pred=k).'''
y_true = np.asarray(y_true)
y_pred = np.asarray(y_pred)
if y_true.shape != y_pred.shape:
raise ValueError("y_true and y_pred must have the same shape")
true_labels, true_idx = encode_labels(y_true)
pred_labels, pred_idx = encode_labels(y_pred)
n_classes = true_labels.size
n_clusters = pred_labels.size
N = np.zeros((n_classes, n_clusters), dtype=int)
np.add.at(N, (true_idx, pred_idx), 1)
return N, true_labels, pred_labels
def entropy_from_counts(counts: np.ndarray) -> float:
'''Shannon entropy of a discrete distribution given counts.'''
counts = np.asarray(counts, dtype=float)
total = counts.sum()
if total <= 0:
return 0.0
p = counts[counts > 0] / total
return float(-(p * np.log(p)).sum())
def conditional_entropy_C_given_K_from_contingency(N: np.ndarray) -> float:
'''Compute H(C|K) from contingency matrix N (classes x clusters).'''
N = np.asarray(N, dtype=float)
n = N.sum()
if n <= 0:
return 0.0
n_k = N.sum(axis=0, keepdims=True) # (1, K)
# H(C|K) = - sum_{c,k} p(c,k) log p(c|k)
with np.errstate(divide="ignore", invalid="ignore"):
p_ck = N / n
p_c_given_k = np.divide(N, n_k, where=n_k > 0)
terms = np.where(N > 0, p_ck * np.log(p_c_given_k), 0.0)
return float(-terms.sum())
def homogeneity_score_np(y_true, y_pred) -> float:
'''Homogeneity score in [0,1]. Matches sklearn's definition.'''
N, _, _ = contingency_matrix_np(y_true, y_pred)
H_C = entropy_from_counts(N.sum(axis=1))
if H_C == 0.0:
return 1.0
H_C_given_K = conditional_entropy_C_given_K_from_contingency(N)
h = 1.0 - H_C_given_K / H_C
# Numerical safety
return float(np.clip(h, 0.0, 1.0))
# Quick sanity check vs scikit-learn
y_true = rng.integers(0, 4, size=500)
y_pred = rng.integers(0, 7, size=500)
h_np = homogeneity_score_np(y_true, y_pred)
h_sk = sk_homogeneity_score(y_true, y_pred)
print("homogeneity (numpy): ", h_np)
print("homogeneity (sklearn):", h_sk)
print("abs diff:", abs(h_np - h_sk))
# Edge case: one true class -> defined as 1.0
print("one-class edge case:", homogeneity_score_np(np.zeros(20), rng.integers(0, 3, size=20)))
homogeneity (numpy): 0.007663540181941153
homogeneity (sklearn): 0.007663540181940872
abs diff: 2.8102520310824275e-16
one-class edge case: 1.0
4) Worked toy example + plots#
Let’s build a small example and look at:
the contingency matrix
per-cluster class proportions
per-cluster class entropy (how “mixed” each cluster is)
y_true_toy = np.array([
"A", "A", "A", "A", "A",
"B", "B", "B", "B",
"C", "C", "C", "C",
])
# Clusters are somewhat mixed:
# - cluster 0: mostly A
# - cluster 1: mix of A and B
# - cluster 2: pure C
# - cluster 3: mix of B and C
y_pred_toy = np.array([
0, 0, 0, 1, 1,
1, 1, 3, 3,
2, 2, 2, 3,
])
N_toy, classes_toy, clusters_toy = contingency_matrix_np(y_true_toy, y_pred_toy)
h_toy = homogeneity_score_np(y_true_toy, y_pred_toy)
print("classes:", classes_toy)
print("clusters:", clusters_toy)
print("contingency N (rows=class, cols=cluster):")
print(N_toy)
print("homogeneity:", h_toy)
fig = px.imshow(
N_toy,
x=[f"cluster {k}" for k in clusters_toy],
y=[f"class {c}" for c in classes_toy],
text_auto=True,
color_continuous_scale="Blues",
title=f"Toy contingency matrix (homogeneity={h_toy:.3f})",
labels={"x": "predicted cluster", "y": "true class", "color": "count"},
)
fig.update_layout(coloraxis_showscale=False)
fig.show()
classes: ['A' 'B' 'C']
clusters: [0 1 2 3]
contingency N (rows=class, cols=cluster):
[[3 2 0 0]
[0 2 0 2]
[0 0 3 1]]
homogeneity: 0.6704302058675669
# Per-cluster class proportions and per-cluster entropy
cluster_sizes = N_toy.sum(axis=0)
proportions = np.divide(N_toy, cluster_sizes, where=cluster_sizes > 0)
cluster_entropies = np.array([entropy_from_counts(N_toy[:, k]) for k in range(N_toy.shape[1])])
fig = make_subplots(
rows=1,
cols=2,
subplot_titles=("Class proportions within each cluster", "Entropy within each cluster"),
)
# stacked bars (proportions)
for i, c in enumerate(classes_toy):
fig.add_trace(
go.Bar(
x=[f"cluster {k}" for k in clusters_toy],
y=proportions[i],
name=f"class {c}",
),
row=1,
col=1,
)
fig.update_yaxes(title_text="proportion", range=[0, 1], row=1, col=1)
fig.update_xaxes(title_text="cluster", row=1, col=1)
# entropies
fig.add_trace(
go.Bar(
x=[f"cluster {k}" for k in clusters_toy],
y=cluster_entropies,
name="entropy",
marker_color="gray",
),
row=1,
col=2,
)
fig.update_yaxes(title_text="H(C | K=k)", row=1, col=2)
fig.update_xaxes(title_text="cluster", row=1, col=2)
fig.update_layout(barmode="stack", title_text="What makes homogeneity go up/down")
fig.show()
5) How mixing affects homogeneity#
Consider a binary problem with two equally common classes.
We’ll create cluster labels by copying the true labels and then flipping a fraction \(\varepsilon\) of them.
\(\varepsilon = 0\) ⇒ perfectly pure clusters ⇒ homogeneity = 1
larger \(\varepsilon\) ⇒ more mixing inside clusters ⇒ homogeneity drops
def flip_fraction(y, eps: float, rng: np.random.Generator) -> np.ndarray:
y = np.asarray(y, dtype=int)
if not (0.0 <= eps <= 1.0):
raise ValueError("eps must be in [0,1]")
y_pred = y.copy()
flip = rng.random(size=y.size) < eps
y_pred[flip] = 1 - y_pred[flip]
return y_pred
n = 2000
# perfectly balanced classes
true_bin = np.r_[np.zeros(n // 2, dtype=int), np.ones(n // 2, dtype=int)]
rng.shuffle(true_bin)
eps_grid = np.linspace(0.0, 0.5, 51)
h_values = []
for eps in eps_grid:
pred_bin = flip_fraction(true_bin, eps=float(eps), rng=rng)
h_values.append(homogeneity_score_np(true_bin, pred_bin))
fig = go.Figure()
fig.add_trace(go.Scatter(x=eps_grid, y=h_values, mode="lines+markers", name="homogeneity"))
fig.update_layout(
title="Homogeneity vs label mixing (binary flip noise)",
xaxis_title="flip fraction ε",
yaxis_title="homogeneity",
yaxis_range=[0, 1.02],
)
fig.show()
6) Pitfall: over-segmentation can reach 1.0#
Homogeneity ignores whether a class is split across many clusters.
If each class is divided into multiple sub-clusters (all pure), homogeneity stays 1.0, even though the clustering is often less useful.
We’ll demonstrate this by taking \(C=3\) classes and splitting each class into \(m\) pure clusters.
We’ll also show completeness and V-measure for contrast.
C = 3
n_per_class = 400
y_true = np.repeat(np.arange(C), n_per_class)
rng.shuffle(y_true)
def split_each_class_into_m_clusters(y_true, m: int, rng: np.random.Generator) -> np.ndarray:
y_true = np.asarray(y_true, dtype=int)
y_pred = np.empty_like(y_true)
for c in range(np.max(y_true) + 1):
idx = np.where(y_true == c)[0]
sub = rng.integers(0, m, size=idx.size)
y_pred[idx] = c * m + sub
return y_pred
m_grid = np.arange(1, 21)
h_list = []
comp_list = []
v_list = []
for m in m_grid:
y_pred = split_each_class_into_m_clusters(y_true, m=int(m), rng=rng)
h_list.append(homogeneity_score_np(y_true, y_pred))
comp_list.append(sk_completeness_score(y_true, y_pred))
v_list.append(sk_v_measure_score(y_true, y_pred))
fig = go.Figure()
fig.add_trace(go.Scatter(x=m_grid, y=h_list, mode="lines+markers", name="homogeneity"))
fig.add_trace(go.Scatter(x=m_grid, y=comp_list, mode="lines+markers", name="completeness"))
fig.add_trace(go.Scatter(x=m_grid, y=v_list, mode="lines+markers", name="v-measure"))
fig.update_layout(
title="Over-segmentation: splitting each class into m pure clusters",
xaxis_title="m (clusters per true class)",
yaxis_title="score",
yaxis_range=[0, 1.02],
)
fig.show()
7) Using homogeneity to tune k-means (grid search)#
Homogeneity is not differentiable w.r.t. model parameters (it depends on discrete assignments), so you normally use it for:
comparing clustering algorithms
selecting hyperparameters (like number of clusters \(k\))
Below is a tiny NumPy k-means implementation and a grid search over \(k\).
We’ll see an important behavior:
as \(k\) increases, homogeneity often increases (sometimes monotonically)
So optimizing for homogeneity alone tends to push toward larger \(k\) unless you constrain \(k\) or pair it with completeness / V-measure.
def kmeans_fit_predict_np(X: np.ndarray, k: int, n_iters: int = 50, seed: int = 0):
'''Simple k-means (Lloyd) implementation. Returns labels and centroids.'''
X = np.asarray(X, dtype=float)
n, d = X.shape
if not (1 <= k <= n):
raise ValueError("k must be in [1, n]")
rng_local = np.random.default_rng(seed)
# init: choose k random points as centroids
centroids = X[rng_local.choice(n, size=k, replace=False)].copy()
labels = np.full(n, -1, dtype=int)
for _ in range(n_iters):
# squared distances to each centroid (n, k)
d2 = np.sum((X[:, None, :] - centroids[None, :, :]) ** 2, axis=2)
new_labels = np.argmin(d2, axis=1)
if np.array_equal(new_labels, labels):
break
labels = new_labels
# update step
for j in range(k):
mask = labels == j
if np.any(mask):
centroids[j] = X[mask].mean(axis=0)
else:
# empty cluster: re-seed to a random point
centroids[j] = X[rng_local.integers(0, n)]
return labels, centroids
# Dataset with known classes (so we can compute external metrics)
X, y_true = make_blobs(
n_samples=1500,
centers=3,
n_features=2,
cluster_std=[1.0, 1.2, 0.9],
random_state=3,
)
fig = px.scatter(
x=X[:, 0],
y=X[:, 1],
color=y_true.astype(str),
title="Ground-truth classes (for evaluation)",
labels={"x": "x1", "y": "x2", "color": "true class"},
)
fig.show()
# Compare different k-means clusterings visually
def plot_clustering(X, labels, title: str):
fig = px.scatter(
x=X[:, 0],
y=X[:, 1],
color=labels.astype(str),
title=title,
labels={"x": "x1", "y": "x2", "color": "cluster"},
)
fig.show()
for k in [2, 3, 6]:
km = KMeans(n_clusters=k, n_init=10, random_state=0)
y_pred = km.fit_predict(X)
h = homogeneity_score_np(y_true, y_pred)
plot_clustering(X, y_pred, title=f"KMeans k={k} (homogeneity={h:.3f})")
# Grid search over k and random seeds (using the NumPy k-means above)
k_values = np.arange(2, 13)
seeds = np.arange(0, 15)
rows = []
for k in k_values:
for seed in seeds:
labels, _ = kmeans_fit_predict_np(X, k=int(k), n_iters=80, seed=int(seed))
rows.append(
{
"k": int(k),
"seed": int(seed),
"homogeneity": homogeneity_score_np(y_true, labels),
"completeness": sk_completeness_score(y_true, labels),
"v_measure": sk_v_measure_score(y_true, labels),
}
)
# best seed per k (by homogeneity)
best_by_k = {}
for r in rows:
k = r["k"]
if (k not in best_by_k) or (r["homogeneity"] > best_by_k[k]["homogeneity"]):
best_by_k[k] = r
best_rows = [best_by_k[k] for k in k_values]
best_k_by_h = max(best_rows, key=lambda r: r["homogeneity"])["k"]
print("best k by homogeneity:", best_k_by_h)
fig = go.Figure()
# scatter all runs
fig.add_trace(
go.Scatter(
x=[r["k"] for r in rows],
y=[r["homogeneity"] for r in rows],
mode="markers",
name="homogeneity (all seeds)",
marker=dict(size=6, opacity=0.35),
)
)
# line: best homogeneity per k
fig.add_trace(
go.Scatter(
x=[r["k"] for r in best_rows],
y=[r["homogeneity"] for r in best_rows],
mode="lines+markers",
name="best homogeneity per k",
)
)
# lines: completeness and v-measure for the same best runs
fig.add_trace(
go.Scatter(
x=[r["k"] for r in best_rows],
y=[r["completeness"] for r in best_rows],
mode="lines+markers",
name="completeness (same best-by-h runs)",
)
)
fig.add_trace(
go.Scatter(
x=[r["k"] for r in best_rows],
y=[r["v_measure"] for r in best_rows],
mode="lines+markers",
name="v-measure (same best-by-h runs)",
)
)
fig.add_vline(
x=best_k_by_h,
line_dash="dash",
line_color="gray",
annotation_text=f"best k by homogeneity: {best_k_by_h}",
)
fig.update_layout(
title="Selecting k by homogeneity (watch the over-segmentation bias)",
xaxis_title="k",
yaxis_title="score",
yaxis_range=[0, 1.02],
)
fig.show()
best k by homogeneity: 7
8) Pros/cons + when to use#
Pros#
Interpretable: “cluster purity” aligned with many real use cases
Scale [0, 1] and label-permutation invariant
Works for multiclass and imbalanced class distributions
Information-theoretic: connects to entropy and mutual information
Cons / pitfalls#
Requires ground-truth labels (so it’s not usable for truly unsupervised evaluation)
Ignores completeness → can be artificially high with many clusters (over-segmentation)
Not a smooth/differentiable objective (used for evaluation / selection, not gradient training)
Can hide issues if small impure clusters exist but are tiny (weighted by cluster size)
Good use cases#
Benchmarking clustering when you have a gold standard (topics, categories, known segments)
Situations where mixing classes inside a cluster is especially harmful (you need “clean buckets”)
As part of V-measure (homogeneity + completeness) or alongside other external metrics (ARI, AMI)
References#
Rosenberg, A., & Hirschberg, J. (2007). V-measure: A conditional entropy-based external cluster evaluation measure.
scikit-learn API: https://scikit-learn.org/stable/modules/generated/sklearn.metrics.homogeneity_score.html
Related metrics:
completeness_score,v_measure_score,adjusted_rand_score,adjusted_mutual_info_score
Exercises#
Create a clustering with high homogeneity but low completeness. Verify with plots.
Modify the toy example so one small cluster is very impure. How much does homogeneity change?
Implement
completeness_scorefrom scratch and reproduce V-measure.